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Abstract 

Frank-Wolfe algorithms have recently regained the attention of the Ma¬ 
chine Learning community. Their solid theoretical properties and sparsity 
guarantees make them a suitable choice for a wide range of problems in 
this field. In addition, several variants of the basic procedure exist that 
improve its theoretical properties and practical performance. In this pa¬ 
per, we investigate the application of some of these techniques to Machine 
Learning, focusing in particular on a Parallel Tangent (PARTAN) variant 
of the FW algorithm that has not been previously suggested or studied for 
this type of problems. We provide experiments both in a standard setting 
and using a stochastic speed-up technique, showing that the considered 
algorithms obtain promising results on several medium and large-scale 
benchmark datasets for SVM cleissification. 


1 Introduction 

The Frank-Wolfe algorithm (hereafter FW) is a classical method for convex opti¬ 
mization that has seen a substantial revival in interest from researchers 
Recent results have shown that the family of FW algorithms enjoys powerful 
theoretical properties such as iteration complexity bounds that are indepen¬ 
dent of the problem size, provable primal-dual convergence rates, and sparsity 
guarantees that hold during the whole execution of the algorithm mil]. Fur¬ 
thermore, several variants of the basic procedure exist which can improve the 
convergence rate and practical performance of the basic FW iteration [gisiTii]. 
Finally, the fact that FW methods work with projection-free iterations is an es¬ 
sential advantage in applications such as matrix recovery, where a projection 
step (as needed, e.g., by proximal methods) has a super-linear complexity [Ul^. 
As a result, FW is now considered a suitable choice for large-scale optimization 
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problems arising in several contexts such as Machine Learning, statistics, bioin¬ 
formatics and other fields nni [HI na. In the context of SVM classification, for 
example, FW methods have been shown to perform well on large-scale datasets 
with hundreds of thousands of examples, thus providing a promising alternative 
to solvers such as Active Set methods and SMO nmi], whose applicability is 
often limited to small and medium scale problems [nii]. 

In this paper, we consider the application of some well-known variants of 
the FW algorithm to Machine Learning problems, focusing in particular on a 
type of FW iteration known in the literature as PARTAN, which to the best of 
our knowledge has not previously been employed for this kind of application. 
Using several benchmark SVM datasets, we show that this variant is able to 
accelerate the standard FW method, obtaining on average a 2.52x speedup 
in CPU time. Furthermore, we show how some FW variants indeed display 
a faster convergence rate in practice using a primal-dual stopping criterion, 
though their advantage is limited when the value of the tolerance parameter is 
not too strict. Finally, to further improve running times on large problems, we 
consider a random sampling speedup technique, and elaborate on its advantages 
and drawbacks, particularly on the existence of a tradeoff between iteration 
complexity and risk of a premature convergence. 

Structure of the Paper 

The paper is organized as follows. Section |2| provides a general overview of the 
FW method and its modifications, their theoretical properties and some appli¬ 
cations to Machine Learning, while in Section |3| we examine in more detail the 
PARTAN variant of FW. Then, in Section |4l we perform numerical experiments 
on SVM problems to assess the performance of the considered methods, and 
close the paper by summarizing our conclusions in Section |5l 

2 The Prank-Wolfe Method and its Variants 

The FW algorithm [T] is a general method to solve optimization problems of 
the form 

min /(a), (1) 

where / : R™ —>■ R is a convex differentiable function with Lipschitz continuous 
gradient, and S C R™ a compact convex set. The main idea behind the FW 
iteration is to exploit a linear model of the objective function at the current 
iterate to define a new search direction. In its basic form, the standard FW 
algorithm can be schematized as in Algorithm |T] 

2.1 Theoretical Properties 

We summarize here, for the sake of completeness, some well-known primal-dual 
convergence results for the FW algorithm. Proofs for these results can be found 

in jam]. 
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Algorithm 1 The general FW algorithm. 

1: Input: an initial guess a°. 

2: for fc = 0, 1 ,... do 

3 : Define a search direction di^ = where 

yW G argmin(u - (2) 

u e E 


4 : Choose a stepsize , either via the line-search 


G argmin/(a*^^^ -I- Adpw) 
ag[o.i] 

or with the rule A*-^^ = 2/(fc -|- 2) [2]. 

5 : Update: 

Q,(fc+i) = Q,(fc) + = (1 - A('=))a('=) 


6: end for 


Proposition! (Sublinear convergence). Let a* be an optimal solution for prob¬ 
lem m- Then, for k >1, the iterates of Algorithm\J\ satisfy 

/(aW)-/(a*)<i^, 

where Cf is the curvature constant of the objective m- 

Choice of the stopping criterion. As a consequence of Proposition [1] 
we immediately have that Algorithm [T] requires 0(l/e) iterations to obtain an 
e-approximate solution, i.e. a solution s.t. — f{a*) < e. However, 

given that the primal gap /(a^^^) — f{a*) is not a computable quantity, this fact 
cannot be exploited directly. Instead, the stopping condition for FW algorithms 
is usually based on the following duality gap criterion [2]: 

max(a^^^ — < e . (3) 

u € s 


This is motivated by the fact that the duality gap provides an upper bound for 
the primal gap, i.e. f{a^^^) — f{a*) < while at the same time enjoying 

the same asymptotic guarantees. 

Proposition 2 (Primal-dual convergence). After K >2 iterations, Algorithm 
produces at least one iterate 1 A ^ A K, s.t. 



< 


2{K + 2) ■ 


( 4 ) 
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From Proposition it immediately follows that the 0{l/e) complexity 
bound holds for as well. Furthermore, the above results give the tolerance 
parameter e a clean interpretation as a tradeoff between optimization accuracy 
and overall computational complexity. 

2.2 Variants on the Classical Iteration 

Though endowed with solid theoretical properties, the standard FW algorithm 
exhibits a rather slow convergence rate, and is known to be prone to stagna- 
tion, as dp^ tends to become nearly orthogonal to the gradient when nearing 
a solution m- Solutions to this drawback date back to the 1970s, and mostly 
consist in algorithmic variations where an alternative search direction is added 
to avoid stalling. 

Among the most well-known variants of this kind is the Modified Frank- 
Wolfe method (MFW) 0 [HI 0 IS] ■ In the modified FW iteration, we define an 
alternative search direction by maximizing the linear model: 

G argmax(u — 

u G E 

and then setting . The best descent direction is then selected, 

i.e. we choose if < V/(Q;^*^)^dpw, and stick to the standard 

dpw otherwise. 

Another option is to use a pairwise (or “swap”) FW iteration as proposed in 
[7j, where the alternative search direction is defined as dgw = In 

this case, the choice between dpw and dsw is based on a greedy criterion, i.e. 
we select the step that yields the best function value. It can be proved that the 
resulting procedure enjoys properties analogous to those of the MFW algorithm. 

As the specialization of these algorithms has already been presented exten¬ 
sively in [7], we do not discuss them further, and refer to the literature for 
implementation details. In a similar vein, other options for improving the FW 
iterations, such as conjugate direction based FW or FW with optimization on 
a 2-dimensional convex hull |16| . are not included in this paper due to space 
constraints. 

From a theoretical point of view, these variants often enjoy improved con¬ 
vergence guarantees. In particular, under suitable hypotheses 0, a linear con¬ 
vergence rate in primal gap can be obtained, i.e. for sufficiently large k we 
have 

/(a(fc+i))-/(a*) ^ 

- f{a*) 

with M G (0,1) a constant. 

However, to the best of our knowledge, no analogous results improving 
Proposition 0 were obtained for the duality gap, meaning that there is no a 

^We refer to the specialized literature for the detailed analyses HOE], noting that the 
necessary hypotheses are satisfied for the test problems used in this paper. 
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priori guarantee that a stopping criterion based on is able to capture the 
improved behaviour of the algorithm. Furthermore, as the linear convergence 
results are asymptotic in nature, it is not possible in general to predict whether 
for a given tolerance the linear rate will kick in before the algorithm stops. Some 
of the experiments in Section 2] aim precisely at investigating these issues and 
their practical impact. 

2.3 Applications to Machine Learning 

One of the most prominent examples of applications of FW-based algorithms to 
the field of Machine Learning is given by the binary nonlinear L 2 -SVM training 
problem [n]: 

m 

min fia) = 1:0^Ka s.t. ai = 1, a > 0 , (5) 

Here, iF is a positive definite kernel matrix, and the feasible set S is the unit 
simplex, whose vertices are the coordinate vectors ei,..., Cm- It is easy to see 
that in this case 

=e.(fc), where G argmin (6) 

Though FW methods can in principle be applied to any SVM formulation giving 
rise to a compact and convex feasible set, the L 2 -SVM is chosen here because 
of its convenience. The geometry of the unit simplex yields indeed very simple 
formulas for the key steps in the FW iteration, which from a computational 
perspective leads to an extremely efficient implementation [7]. 

It we denote = {i \ > 0}, it follows directly from ([6]) that at iteration 

k the solution can be expressed in terms of at most k+\T^^'> \ data points [31 [7], or 
in other words that the number of Support Vectors is bounded during the entire 
run of the algorithm, which constitutes a substantial advantage of FW methods 
in comparison to methods with dense iterates. This holds true in particular for 
nonlinear SVM problems with datasets where the solution is sparse (in terms 
of the number of SVs defining the classification model), on which the latter 
suffer from the so-called “curse of kernelization” and are unable to recover the 
sparsity of the solution [ 15 ] . In addition. Proposition [ 3 ] implies that the total 
number of iterations is independent of the dataset size m. Together with the 
sparsity certihcate, this also implies that the memory requirement for the whole 
algorithm is bounded independently of m. 

Another related problem that can be tackled with a FW method is the Lasso 
problem with a 1-norm constraint 

min f{a):=\\Aa — b\\l s.t. ||a||i<t, 

where A G measurement matrix and b G M". In this case, the 

advantage of a FW-based method would be the possibility to well approximate 
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the solution of high-dimensional problems using a reduced set of explanatory 
variables. Indeed, from Proposition [T] we have that at most 0{l/e) “active” 
features are required to reach an e-approximate optimality, independently of 
the dimensionality of the feature space in which the observations have been 
embedded. 

Finally, matrix recovery problems with nuclear norm regularization of the 
form 

min f{a) := \\A{a) — b\\\ s.t. ||a||*<t, 

where A : —>■ Rp is a linear operator and 5 e Rp, have also been success¬ 

fully tackled with FW-based solvers [5] . The motivation here is mainly that FW 
methods do not require projection steps. The solution of the linear approxima¬ 
tion step can be obtained in a fast way by solving a largest eigenvalue problem, 
as opposed to proximal methods that require a full SVD of the gradient matrix 
at each iteration, which is prohibitive for large-scale problems. 

As a motivating example, we consider the SVM problem ([S]) for the ex¬ 
periments in this paper, not only because of its significance, but also to al¬ 
low for a comparison with the results obtained in previous research efforts 

m [201 null III]- 

3 PARTAN Prank-Wolfe Iterations 

Another variant of the FW algorithm that has been proposed and successfully 
employed (for example, in traffic assignment applications [221 [2^ [24l I16| l con¬ 
sists in an adaptation of the method of Parallel Tangents (PARTAN) to FW 
iterations [20]. To the best of our knowledge, though, this scheme has not yet 
been investigated in Machine Learning applications. 

The basic idea, as seen from Figurejl] is to incorporate previous information 
by performing an averaging between the classical FW step and the previous 
iterate. First, an intermediate FW step is defined: 

5 = (1 - AW)aW-k (7) 

Then, the previous iterate is used to dehne an extra search direction: 

+ ( 8 ) 

Stepsizes A^^^ and can be determined via line-search. 

A geometrical interpretation of the PARTAN method can be obtained by 
looking at the typical behaviour of a standard FW iteration near a solution: the 
fact that the search direction of the FW method tends to become orthogonal 
to V/(a^^^) close to the optimum can easily lead to a zigzagging trajectory, as 
seen from Figure |21a). A simple way to circumvent this behaviour consists in 
performing an extra line-search along the line connecting to a (which 

corresponds to a basic FW step from The case depicted in Figure |2](b) 

shows how PARTAN is able to avoid traversing the “sawtooth” in the trajectory. 
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Figure 1: Sketch of the search directions used by PARTAN iterations. 

directly moving towards a point closer to the solution. It is apparent how this 
approach is especially advantageous if the stepsizes can be computed by a closed 
formula, as is the case, e.g., for quadratic objective functions. 



2/1 = 2/3 


Figure 2: A geometrical interpretation of the PARTAN-FW iteration. 

When specialized to the SVM problem ([5]), the algorithm assumes a simpler 
form, as the key steps in each iteration can be performed analytically. The 
necessary formulas, which are obtained via elementary algebraic manipulations, 
are reported in the Appendix. For the purposes of the discussion here, it suf¬ 
fices to mention that the cost per iteration of the PARTAN method is nearly 
equivalent to that of the standard FW, as also demonstrated by the numerical 
results in the next section. Regarding the stopping criterion, the duality gap 
can be conveniently computed as 


=V/(a«)^(., -2/(aW). 


(9) 


We summarize the overall procedure in Algorithm [2] 
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Algorithm 2 The PARTAN-FW algorithm for problem ([S]). 

1: Input: an initial estimate and a tolerance e. 

2 : Compute via a standard FW step. 

3; Search for € argmax^ 

4; Initialize the duality gap as in ([5]). 

5; Set k = 1. 

6: while > e do 

7; Compute the optimal FW steplength as in m- 

8: Compute the function value after the intermediate FW step as in (IT^ . 

9: Compute Wk as in (ITT)) . 

10: Compute the optimal PARTAN steplength as in (1131) . 

11: Perform the PARTAN step (jH)) as: 

Q,(fc+i) ^ ^(fe) _ ^ik) _ ^(fe)AW)Q,W_ 

+ (aW + 

12: Update the function value as in (ITKl) . 

13: Set k := k + 1. 

14: Search for e argmaxj 

15: Update the duality gap as in ([9]). 

16: end while 


4 Numerical Results 

In this section, we assess the performance of all the considered variants of FW 
on the binary classification problem ([5]), using the benchmark datasets listed in 
Table [TJ The number of examples in the training set and test set are denoted 
by m and t, respectively, while n denotes the number of features. 


Dataset 

m 

t 

n 

Adult a9a 

32,561 

16,281 

123 

Web w8a 

49, 749 

14,951 

300 

IJCNNl 

49, 990 

91,701 

22 

USPS-ext 

266, 079 

75, 383 

675 

KDD99-binary 

395,216 

98, 805 

38 

RC V 1-binary 

677, 399 

20, 242 

47, 236 


Table 1: List of the benchmark datasets for problem ([5)l. 

All the experiments are performed with an RBF kernel. Due to the size 
of the datasets, the SVM regularization parameter is selected by a simple ap¬ 
proach, where a single validation set is built by randomly extracting 70% of 
the training examples, and the remaining 30% is reserved for testing q For the 
RCVl dataset, we used the value suggested in [26]. The kernel width is selected 

^The same values of the hyper-parameters are used for all the methods. 
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according to the heuristic in HU. The algorithms are coded in CH—h and run in 
Linux on a 3.40 GHz Intel i7 machine with 16 GB of main memory. 

In the first experiment, we set £ = 10“^ in the stopping criterion (|3]) and 
evaluate the performance of the proposed methods in terms of test accuracy, 
GPU time (in seconds), number of iterations and model size (number of SVs). 
Results are reported in Table [5] 



FW 

MFW 

SWAP 

PARTAN 

Adult a9a 

Acc (%) 

84.21 

83.29 

83.53 

84.00 

Time 

1.58e + 02 

1.57e + 02 

2.26e + 02 

1.07e + 02 

Iter 

2.02e + 04 

2.00e + 04 

1.76e + 04 

1.34e + 04 

SVs 

1.39e + 04 

1.28e + 04 

1.41e + 04 

1.18e + 04 

Web w8a 

Acc (%) 

99.30 

99.32 

99.28 

99.30 

Time 

3.78e + 02 

3.16e + 02 

3.56e + 02 

1.07e + 02 

Iter 

1.65e + 04 

1.38e + 04 

9.24e + 03 

4.62e + 03 

SVs 

6.92e + 03 

4.48e + 03 

4.97e + 03 

2.83e 03 

IJCNNl 

Acc (%) 

98.50 

98.22 

98.40 

98.36 

Time 

5.13e + 01 

4.57e + 01 

5.09e + 01 

1.98e + 01 

Iter 

1.59e + 04 

1.41e + 04 

1.35e + 04 

5.48e + 03 

SVs 

3.23e “h 03 

2.73e + 03 

3.16e + 03 

2.73e + 03 

USPS-ext 

Acc (%) 

99.52 

99.52 

99.53 

99.52 

Time 

1.98e + 03 

8.44e + 02 

9.46e + 02 

7.38e + 02 

Iter 

2.15e + 04 

9.17e + 03 

8.10e + 03 

7.79e + 03 

SVs 

3.93e 03 

3.64e + 03 

3.67e + 03 

3.51e + 03 

KDD99-binary 

Acc (%) 

99.94 

99.93 

99.94 

99.93 

Time 

7.02e + 02 

5.26e + 02 

5.51e + 02 

1.89e + 02 

Iter 

1.71e + 04 

1.27e + 04 

7.82e + 03 

4.37e + 03 

SVs 

5.25e + 03 

3.63e + 03 

3.86e -|- 03 

2.82e + 03 

RC V 1-binary 

Acc (%) 

97.55 

96.64 

97.17 

97.50 

Time 

1.37e + 04 

1.36e + 04 

1.71e + 04 

1.23e + 04 

Iter 

3.77e + 04 

3.81e + 04 

3.65e + 04 

3.38e + 04 

SVs 

3.75e + 04 

3.58e + 04 

3.65e + 04 

3.37e + 04 


Table 2: Gomparison of different variants of FW on benchmark SVM datasets. 


It can be seen that all the algorithms generally exhibit a good performance. 
In particular, the PARTAN variant in Algorithm [2] yields the most consistent 
results, improving on the running times of the plain FW by a factor of 2.52 on 
average. Results relative to test accuracy and model sizes are fairly stable, with 
no particular variant outperforming the others in most cases, though it should 
be noted that PARTAN is often able to find a smaller SV set. This means that 
the number of spurious points (i.e. active examples which are not part of the 
true SV set) selected by the FW iterations is potentially reduced, as especially 
evident on the Web w8a dataset. 

We can also see how the reduction in computational time for PARTAN- 
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FW is roughly proportional to the decrease in the number of iterations, which 
confirms our intuition that using the PARTAN algorithm on SVMs does not 
imply a higher iteration complexity than that of the standard FW 0. Seeing 
how this technique provides a systematic speedup with no evident drawbacks 
when compared to the standard FW, we recommend it over the latter for large- 
scale SVM problems. 

A potentially relevant observation is that the benefit of using the PARTAN- 
accelerated iteration is related to a good extent to the sparsity of the solution. 
The advantage is indeed more apparent on problems where the size of the SV set 
is a small fraction of the total number of examples, with the KDD99-binary 
dataset being a prominent example. 

On the other hand, the FW variants show no advantage over the standard 
algorithm on the RCV 1-binary problem. This is arguably because the number 
of SVs is basically the same as the total number of iterations. Since the FW 
algorithm spends all of its iterations adding new vertices (i.e. examples cor¬ 
responding to nonzero components of to the model, the usual slowdown 
behaviour of FW, where the algorithm cycles between the same vertices read¬ 
justing their weights, is not observed. As such, there is little benefit in adding 
modified FW directions. The same phenomenon is observed, on a smaller scale, 
on the Adult a9a dataset. 

This is consistent with the fact that FW methods are best used to solve 
sparse problems, as also suggested by their theoretical properties. Conversely, 
their usefulness is more limited when the solution is dense, as the incremental 
nature of the algorithm provides no particular advantage in this case. 

As far as the difference in performance between the variants is concerned, 
we remark that the theoretical results in Section [T] are of asymptotic nature, 
and as such it is difficult to assess their practical impact with a fixed value of 
e, which might be too large to observe a faster convergence compared to the 
standard FW. We investigate this issue in the next paragraph. 

4.1 Considerations on the Iteration Complexity 

We now attempt to better assess the practical difference between the standard 
FW and its variants, in order to understand how and when the latter can give a 
subtantial advantage. In particular, we want to estabilish whether the improved 
convergence predicted by the theory can be observed experimentally when using 
a duality gap-based stopping criterion. To this end, we apply all the considered 
variants of FW to the datasets Adult a9a, Web w8a and IJCNNl, using 
increasingly strict tolerances e G {10“^,..., 10“®}, and monitoring the number 
of iterations needed to trigger the stopping condition. We do not attempt to 
solve the larger scale problems here, as the smallest value of e would lead to 
prohibitive running times, and we remark that this experiment aims exclusively 
at providing an insight on the convergence speed of the algorithms. Results are 
shown in Table |3l 

^Note that this also holds true for the other variants [7]. 
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£ 

le-03 

le-04 

le- 05 

le - 06 

Adult a9a 

FW 

Time 

2.24e + 01 

1.58e + 02 

1.46e -1- 03 

1.42e-|-04 


Iter 

2.82e + 03 

2.02e -1- 04 

1.84e-|-05 

1.79e-|-06 

MFW 

Time 

2.20e-|-01 

1.57e -1- 02 

5.48e -1- 02 

1.21e-|-03 


Iter 

2.77e + 03 

2.00e -1- 04 

6.80e -1- 04 

1.50e-|-05 

SWAP 

Time 

3.07e -1- 01 

2.26e -1- 02 

6.46e -1- 02 

1.30e-|-03 


Iter 

2.61e-|-03 

1.76e -1- 04 

5.18e-|-04 

1.09e-|-05 

PARTAN 

Time 

1.73e-|-01 

1.07e -1- 02 

5.09e -1- 02 

5.43e -1- 03 


Iter 

2.15e-|-03 

1.64e -1- 04 

6.21e-|-04 

6.63e -1- 05 

Web w8a 

FW 

Time 

4.47e + 01 

3.78e + 02 

3.73e-|-03 

4.02e -1- 04 


Iter 

1.93e + 03 

1.65e-|-04 

1.63e-|-05 

1.75e-|-06 

MFW 

Time 

4.27e -1- 01 

3.16e-|-02 

1.52e -1- 03 

3.78e -1- 03 


Iter 

1.86e-|-03 

1.38e -1- 04 

6.38e -1- 04 

1.65e-|-05 

SWAP 

Time 

6.42e -1- 01 

3.56e -1- 02 

1.31e-|-03 

8.63e -|- 03 


Iter 

1.69e-|-03 

9.24e -1- 03 

4.29e -1- 04 

3.46e -1- 05 

PARTAN 

Time 

1.83e-|-01 

1.07e -1- 02 

5.97e -1- 02 

3.32e -1- 03 


Iter 

7.77e-|-02 

4.62e -1- 03 

2.57e -1- 04 

1.44e-|-05 

IJCNNl 

FW 

Time 

5.77e + 00 

5.13e + 01 

5.49e -1- 02 

6.48e -1- 03 


Iter 

1.72e + 03 

1.59e-|-04 

1.68e-|-05 

1.97e-|-06 

MFW 

Time 

5.16e-|-00 

4.57e-|-01 

2.38e -1- 02 

6.81e-|-02 


Iter 

1.51e-|-03 

1.41e-|-04 

7.12e-|-04 

2.05e -1- 05 

SWAP 

Time 

6.94e -1- 00 

5.09e-|-01 

2.56e -1- 02 

6.56e -1- 02 


Iter 

1.21e-|-03 

1.35e -1- 04 

6.85e -1- 04 

1.77e-|-05 

PARTAN 

Time 

3.09e -1- 00 

1.98e-|-01 

1.60e-|-02 

1.83e-|-03 


Iter 

7.83e -1- 02 

5.48e -1- 03 

4.34e -1- 04 

4.99e -1- 05 


Table 3: Iteration complexity of different variants of FW. 


From the results, it is clear how the standard FW behaves according to the 
0{l/£) iteration bound, with the number of iterations increasing 10-fold every 
time the tolerance parameter decreases by one order of magnitude. This corre¬ 
sponds to the duality gap decreasing as 0(l/fc), as predicted by Proposition [5J 
This result suggests that, though (|4]) is an upper bound of the duality gap (and 
thus in turn of the primal gap), it gives in practice a good indication of the num¬ 
ber of iterations that we can expect from the standard FW algorithm, therefore 
implying that the computational effort can be predicted and controlled by ap¬ 
propriately tuning the tolerance parameter. The modified variants, in contrast, 
enjoy a faster convergence rate, ending up gaining a computational advantage of 
one order or magnitude or more with respect to the plain FW when the strictest 
tolerance value is used. It is interesting to note how the three variants analyzed 
here do not always provide the same improvement. As the results presented in 
this work are only preliminary, it is difficult to establish whether this is sim¬ 
ply due to the methods having different convergence factors (e.g. because they 
enjoy convergence rates with the same asymptotic behaviour but differing by a 
constant) or is intrinsically related to the nature of the algorithms (for exam¬ 
ple, PARTAN starts out with a substantial advantage over the other variants 
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at e = 10“'’’, but is outperformed by MFW when seeking for a more accurate 
solution). 

We can attempt to shed some more light on this issue by plotting in Figure 
[3] the duality gap (in logarithmic scale), obtained with e = 10“®, against the 
iteration number for all the variants of the algorithm. From the graphs, it 
can be seen how the MFW algorithm seems to exhibit the best primal-dual 
convergence rate, with oscillations in the duality gap being very small 0. The 
SWAP algorithm performs even better on two of the datasets (though exhibiting 
larger oscillations), but substantially worse on Web w8a. The PARTAN variant 
seems instead to show a behaviour similar to that of the standard FW, but with 
a better convergence factor, an observation which is consistent with the results 
obtained in Tables [5] and [31 It might also be worth noticing that is only 
an upper bound of the optimality measure /(a*^^^) — /(a*), thus occasional 
oscillating values of the duality gap do not imply that the solution is getting 
less accurate. 

Overall, though we are well aware that a more representative batch of prob¬ 
lems would be needed to draw more solid conclusions, the results in Table [3] 
show that the considered FW variants indeed display a faster convergence rate 
in practice using a primal-dual stopping criterion. It is not obvious, however, 
whether the results on the duality gap given by Proposition [3] can be improved 
under suitable hypotheses. They also show how the traditional FW is not a 
suitable method if one wants to use stricter values of e, for example because the 
application at hand requires a higher optimization accuracy. This confirms on a 
Machine Learning problem the well-known intuition that the standard FW step 
stagnates when close to a solution, unless extra search directions which do not 
get orthogonal to the gradient are added mi¬ 
lt should be noted, indeed, that a good choice of e is application-dependent. 
In SVMs for classification, for instance, it is well known that the test accuracy 
is often relatively insensitive to e after a certain threshold. On the other hand, 
different applications, such as function estimation, could be more sensitive to 
the accuracy (in an optimization sense) of the obtained model. Therefore, while 
they may not appear very relevant in the context of classification SVMs, the 
improved properties of some FW modifications may be of importance for other 
related tasks. In this case, we would recommend the use of a FW variant rather 
than the standard algorithm. 

4.2 Results with Randomized Iterations 

As the total number of iterations required by a FW algorithm can be large, 
devising a convenient way to solve the subproblem ([3]) is recommended in order 
to make the algorithm more viable on large-scale datasets. A typical situation 
arises when ([3|) has an analytical solution or it is easy to solve due to the problem 
structure [11137]. This is the case, for example, for all the problems introduced 

^It is important to observe that, as opposed to is not a monotonically de¬ 

creasing quantity. 
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(c) 


Figure 3: Duality gap behaviour of FW algorithms for £ = 10 ® on the datasets 
Adult a9a (a), Web w8a (b), and IJCNNl (c). 


in Section 12.31 Still, the resulting complexity usually depends on the problem 
size (for example, in (jb]) it is proportional to m), and can thus be impractical 
when handling large-scale data. 

A simple and yet effective way to avoid the dependence on m is to look for 
the solution of ([5]) by exploring only a fixed number of extreme points on the 
boundary of s [IZIIMIIII]- In the case of (O , for example, this means extracting 
a sample S C {1,..., m} and solving 

^ € argmin . (10) 

ieS 
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The cost of an iteration becomes in this case |), rather than |) 

as in ([6|E|. 

The stopping criterion, however, is not applicable without computing the en¬ 
tire gradient V/(a^^^), which is not done in the randomized case. As a possible 
alternative, we can use the approximate quantity 

:= 2/(aW)-V/(aW)^(.). 

Since this simplification entails a tradeoff between the reduc¬ 

tion in computational cost and risk of a premature stopping. Although this can 
be acceptable in contexts such as SVM classification, where solving the opti¬ 
mization problem with a high accuracy is usually not needed, it is important to 
make sure that the impact of this approximation is kept to an acceptable level. 
This issue has been discussed in detail in m- Here, to mitigate the effect of 
a possible early stopping, we implement a simple safeguard strategy where the 
sampling is repeated twice in case As{a^'^^) < e. 

In Table SI we report the results obtained with a randomization technique, 
taking |<S| = 194 [^, averaged over 10 runs. Note that we do not attempt to 
run the randomization technique on problems for which this strategy is not 
beneficial. Taking RCVl-binary as an example, it is already clear, from the 
structure of the dataset and the results in Table SJ that using a random sampling 
would not provide any advantage: the SV set size being of order 10^, an iteration 
would have a complexity in the order of millions of floating point operations, 
which is actually much larger than the size of the whole dataset. In general, we 
do not recommend using a random sampling for problems with dense SV sets, 
as in order to obtain a computational gain the number of samples would have 
to be too small, possibly leading to an inaccurate solution. 

First of all, note that the effect of sampling is substantially problem-dependent. 
The best computational gains are obtained on the problems Web w8a, USPS- 
ext and KDD99-binary, with the latter two being the largest and most sparse 
datasets. It should be noted that the reduction in CPU time is attributable both 
to the reduced iteration complexity and to the smaller iteration count, the latter 
being due to the approximate stopping criterion employed. This is not observed 
on all the datasets, however. For example, the total number of iterations on 
Adult a9a and IJCNNl is comparable to that of the deterministic case. The 
MFW algorithm also appears to be overall less sensitive to this particular issue. 

It is interesting to note that this phenomenon does not necessarily lead to a 
loss in test accuracy. This is possibly due to the nature of the SVM classifica¬ 
tion models, which do not require a very accurate solution of the optimization 
problem to build a decision function with a good predictive capability. 

®It should also be noted that a clever implementation allows to eliminate the |lVl| factor 
from in the SVM case m- 

®This value corresponds to a probability of at least 0.98 that lies in the 2% smallest 
components of V/(a(*’l). See Theorem 6.33 in 1281 . 
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FW 

MFW 

SWAP 

PARTAN 

Adult a9a 

Acc (%) 

83.94 

83.86 

83.52 

84.08 

Time 

1.69e-|-02 

1.63e + 02 

1.64e + 02 

1.08e -1- 02 

Iter 

1.89e-|-04 

1.98e + 04 

1.65e + 04 

1.24e -1- 04 

SVs 

1.35e-|-04 

1.24e + 04 

1.35e + 04 

l.lle-l-04 

Web w8a 

Acc (%) 

99.17 

99.21 

99.16 

99.10 

Time 

8.39e-|-01 

6.69e + 01 

7.12e + 01 

4.89e -1- 01 

Iter 

6.78e-|-03 

9.10e + 03 

4.97e + 03 

3.25e -1- 03 

SVs 

3.66e -1- 03 

3.01e + 03 

3.19e + 03 

2.31e-|-03 

IJCNNl 

Acc (%) 

98.57 

97.98 

98.34 

98.37 

Time 

3.45e-|-01 

2.10e + 01 

2.21e + 01 

1.95e-|-01 

Iter 

1.12e-|-04 

1.14e + 04 

7.10e + 03 

5.76e -1- 03 

SVs 

4.12e-|-03 

2.45e + 03 

3.43e + 03 

3.34e -1- 03 

USPS-ext 

Acc (%) 

99.53 

99.57 

99.55 

99.53 

Time 

2.19e-|-02 

2.60e + 02 

2.25e + 02 

2.07e -1- 02 

Iter 

3.61e-|-03 

6.49e + 03 

3.59e + 03 

3-28e -|- 03 

SVs 

2.96e-|-03 

2.48e + 03 

2.94e + 03 

2.86e -|- 03 

KDD99-binary 

Acc (%) 

99.73 

99.93 

99.78 

99.82 

Time (s) 

2.84e-|-01 

9.46e + 01 

2.03e + 01 

1.20e-|-01 

Iter 

1.88e-|-03 

1.29e + 04 

1.57e + 03 

1.15e-|-03 

SVs 

1.71e-|-03 

2.35e + 03 

1.45e + 03 

l.lle-l-03 


Table 4: Comparison of different variants of FW on benchmark SVM datasets 
(randomized iteration). 


5 Conclusions 

The results presented in this paper show that the family of FW algorithms ob¬ 
tains promising results on several benchmark SVM classification tasks, offering 
a solid and fast alternative to the classical solvers used in this field. 

While the experimental results presented here are preliminary, they provide 
the first example of a successful application of the PARTAN-FW iteration to 
Machine Learning problems, showing that this variant is able to accelerate the 
basic FW iteration in a systematic way. On the other hand, the advantage of 
other modified FW algorithms is especially apparent when employing stricter 
tolerance parameters, arguably due to the stronger theoretical properties of the 
enhanced iterations. 

Finally, we have shown how, on larger scale problems, a randomization tech¬ 
nique can be employed to reduce the computational effort with satisfactory 
results, with some caveats related to the tradeoff between complexity and opti¬ 
mization accuracy which is inherent to this kind of strategy. 

Experiments on different machine learning applications (such as Lasso and 
matrix recovery problems) are currently being investigated, and will be the 
subject of another paper. 
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Appendix: Implementation Details 


We report here, for the sake of completeness, the analytical formulas used in the 
implementation of Algorithm [5] for the SVM problem ([S]). To simplify the equa¬ 
tions, we use the shorthand notations and 

After some elementary algebraic manipulations, we obtain that the optimal 
steplength value for step 0 is given by 


= 


2/(fc) - 


(k) 


2/(fe) _ 2V/S - W(.) 


After this step, the objective value becomes 

/ = (1 - + aW (1 - 


The steplength for the PARTAN step (HJ is then given by 

- (1 - AW)wW - 2/ 

^ 2(/-b (1 - A('=))W« - AWV/^U’^ + ’ 


( 11 ) 


( 12 ) 


(13) 


where 

= - 2(1 -b 
- (1 -b 


(14) 


is a quantity that can be computed recursively starting from = 
Finally, the updated objective value after the PARTAN iteration is 


/(fc+i) = (1 + ^(fc))2/ + ^(fc)(1 + ^W)(i _ A('=))Wfe 


(15) 
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